[00:02.530 --> 00:08.710]  Welcome to the last talk of the AI village from Fahid.
[00:11.710 --> 00:14.050]  I'll just let him take it away.
[00:19.410 --> 00:23.750]  Hello and thank you for joining this talk on the security issues and challenges in deep
[00:23.750 --> 00:29.190]  reinforcement learning. I'm Fahid Behzadan, I'm an assistant professor of computer science and
[00:29.190 --> 00:32.550]  data science at the University of New Haven and I also direct the Secure and Assured
[00:32.550 --> 00:38.330]  Intelligent Learning Lab, or SAIL for short, working on AI safety, security and applications
[00:38.330 --> 00:43.410]  of machine learning to cybersecurity and safety of complex systems.
[00:45.710 --> 00:51.310]  So here's the outline of my talk. I'm going to quickly go over the basics of deep reinforcement
[00:51.310 --> 00:56.850]  learning and reinforcement learning. There will be some math but I'll keep it to a minimum.
[00:56.850 --> 01:01.310]  Then we'll talk about vulnerabilities of deep reinforcement learning and whether
[01:01.310 --> 01:08.070]  deep RL is susceptible to classical adversarial machine learning attacks like adversarial examples
[01:08.070 --> 01:13.650]  and such. We'll talk about and develop a threat model for deep reinforcement learning. We'll
[01:13.650 --> 01:22.590]  identify different attack models, attack surfaces, we'll identify different types of
[01:22.590 --> 01:32.590]  vulnerabilities, and we'll introduce a number of attack mechanisms and corresponding defenses
[01:32.590 --> 01:41.750]  that have been developed in recent years. And we'll talk about the frontiers and areas of future
[01:41.750 --> 01:50.230]  research and work in this area. So just a quick overview. I assume most of the audience here is
[01:50.230 --> 01:57.030]  already familiar with the terminology used here. We can classify machine learning algorithms as
[01:57.030 --> 02:01.530]  supervised, unsupervised, and reinforcement learning algorithms. Supervised learning
[02:01.530 --> 02:07.450]  algorithms are those where the training data set includes labeled data, meaning that
[02:08.590 --> 02:15.390]  each data point in that data set comes with the correct label or the correct output expected from
[02:15.650 --> 02:22.690]  a model trained on that data set. Then there is unsupervised learning where clustering
[02:22.690 --> 02:27.910]  and anomaly detection and some other algorithms fall under where there are no labels available
[02:27.910 --> 02:34.670]  for data points. There's no feedback available on the data. However, the goal is to find some
[02:34.670 --> 02:42.870]  underlying structure in the data. And then there is reinforcement learning, which is concerned with
[02:43.350 --> 02:50.130]  the problem of sequential decision making. There's a reward system included in reinforcement learning,
[02:50.130 --> 02:55.550]  but there are fundamental differences between the settings of reinforcement learning and supervised
[02:56.070 --> 03:01.410]  learning, and that's why it merits its own category. We'll talk about those differences
[03:01.410 --> 03:08.750]  in the next few slides. So you can see the general settings of reinforcement learning,
[03:08.750 --> 03:18.910]  or in general, sensory motor agent problems and settings. In RL, we have an agent which is to be
[03:18.910 --> 03:23.870]  trained by a reinforcement learning algorithm. This agent interacts with the environment by
[03:23.870 --> 03:30.770]  performing some action. This action causes the state of the environment to change. Then the new
[03:30.770 --> 03:38.110]  state is observed by the agent. There's also some sort of reward associated with this change,
[03:38.810 --> 03:43.730]  provided or inferred by the agent based on this change in state.
[03:45.790 --> 03:52.570]  Well, you can think of this in terms of playing a game. The environment can be the game environment
[03:52.570 --> 04:01.750]  where different actions may result in scoring or loss of score, and the actions are, well,
[04:01.750 --> 04:06.370]  essentially the actions that the player can take. The state of the environment can be the state of
[04:06.370 --> 04:14.530]  the game. For example, in Breakout, the configuration of these breaks can be one of
[04:14.530 --> 04:19.010]  those where the agent is, where the ball is. If you recall the game of Breakout, of course, if you
[04:19.010 --> 04:26.070]  are old enough to remember Breakout, you're probably familiar with the dynamics. But you can
[04:26.070 --> 04:33.710]  think of any other game, and this setting still applies. What's the goal here? The goal is to
[04:33.710 --> 04:40.570]  learn how to take actions in order to maximize cumulative rewards, not instantaneous rewards,
[04:40.570 --> 04:48.510]  but cumulative rewards. So an RL agent is not just concerned with maximizing the current score.
[04:48.510 --> 04:58.070]  It wants to learn how to act so that at the end of the game, its total sum of rewards is maximized.
[04:58.070 --> 05:05.490]  What are the different applications of RL? Of course, game playing is well publicized because
[05:05.490 --> 05:11.370]  it's a very good testbed, and it has become one of the common testbeds and experimental
[05:12.250 --> 05:18.230]  settings for RL research. However, there are some major real-world applications for reinforcement
[05:18.230 --> 05:23.290]  learning. In essence, RL, or reinforcement learning, is the machine learning response to
[05:24.010 --> 05:30.730]  the need for data-driven control problems. Where do we encounter such problems? Well,
[05:30.730 --> 05:34.850]  in robotics, like autonomous navigation, object manipulation, and such. Algorithmic trading,
[05:34.850 --> 05:39.970]  this is one of the areas that my research group has recently become active in. Critical infrastructure
[05:39.970 --> 05:44.530]  controlling smart cities, resource allocation in smart cities, traffic management, intelligent
[05:44.530 --> 05:50.630]  traffic systems, smart grid management, and control. Healthcare, such as clinical decision
[05:50.630 --> 05:58.510]  making, this is actually one of the better known applications of RL in the real world.
[05:58.670 --> 06:04.140]  And other types of resource management and applications in operations research and such.
[06:04.770 --> 06:13.430]  So, RL is either envisioned or is already heavily adopted in various industries,
[06:13.430 --> 06:20.250]  many of which are critical and may become targets of malicious actions. Let's formalize
[06:20.250 --> 06:25.330]  the RL problem a little more. I'm just going to quickly go over this. There will be some math,
[06:25.330 --> 06:32.010]  but this is just to introduce the basic settings. The underlying framework to formulate the RL
[06:32.010 --> 06:37.030]  problem and provide a framework to think about and reason about the RL problem is the Markov
[06:37.030 --> 06:40.850]  decision process. Why do we call it the Markov decision process? Because it's based on the
[06:40.850 --> 06:45.470]  Markovian assumption, or the Markov property, which says the current state completely characterizes
[06:45.470 --> 06:52.050]  the state of the world. So, if you have the current state and you perform an action,
[06:52.050 --> 06:59.410]  the next state is only going to depend or be a function of the previous state and nothing
[06:59.410 --> 07:03.350]  before it. You don't need to know the history of the environment. All you need to know is what the
[07:03.350 --> 07:11.130]  previous state has been to infer what the coming state is going to be. Now, in Markov decision
[07:11.130 --> 07:19.490]  processes, a problem or setting is formulated by a tuple of five main parameters. One is the
[07:19.490 --> 07:26.830]  set of possible states, S. Set of possible actions, A. Distribution of reward given state
[07:26.830 --> 07:31.090]  action pair, or R. This distribution can also be a function. It doesn't necessarily have to
[07:31.090 --> 07:36.270]  be probabilistic. It can be just a function that tells you if at a particular state, S,
[07:36.270 --> 07:43.410]  I, an action AI is taken, then the reward is going to be, let's say, RI. Then there's transition
[07:43.410 --> 07:54.910]  probability, or transition dynamics, P, which represents the dynamics of the environment.
[07:54.910 --> 07:59.810]  Which state are we going to end up in if we are in a particular state and we perform a particular
[07:59.810 --> 08:06.050]  action. And there's finally a discount factor which defines how myopic the agent is, how much
[08:06.050 --> 08:16.090]  values, rewards that may occur further in the future, further down the line. Now, in general,
[08:16.090 --> 08:23.210]  the RL setting is based on the following process. At time step T equals zero, in the beginning,
[08:23.210 --> 08:30.690]  the environment samples initial state S0. Then, for T equals zero, this is a loop, until done,
[08:30.690 --> 08:34.910]  the agent selects action AT based on some criteria. It can be completely random in the beginning and
[08:34.910 --> 08:41.730]  then slowly becomes more targeted and more policy-driven. It performs some action,
[08:41.730 --> 08:49.430]  the environment produces a reward signal, and the next state, and the agent receives those signals.
[08:49.430 --> 08:55.370]  It may be partial, it may be incomplete, or it may be noisy, but the agent receives some signal
[08:56.390 --> 09:04.070]  resulting from the transition to the new state ST plus one, and a reward that has emerged from
[09:04.070 --> 09:11.270]  that transition and has resulted from that transition. Now, we define a new entity here
[09:11.270 --> 09:17.770]  called pi. A policy pi is a function, or a distribution in the probabilistic case, that
[09:17.770 --> 09:27.370]  maps from state to action. So, it tells the agent what action to perform given any state.
[09:27.870 --> 09:33.730]  And the objective is to find the optimal policy pi star that maximizes the cumulative
[09:33.730 --> 09:42.270]  discounted reward. We also call this return. So, cumulative, meaning the sum of rewards
[09:42.270 --> 09:47.850]  in the entire duration of interaction. It can be one episode of a game, it can be throughout the
[09:48.430 --> 09:55.390]  training period or training horizon, and it's discounted. This gamma is the discount factor
[09:55.390 --> 10:00.750]  that defines how myopic the agent is, or how much it values events that occur further down the line.
[10:00.750 --> 10:08.070]  It's typically a constant value between zero and one, and as you can see, as t increases,
[10:08.070 --> 10:18.770]  this gamma to power t decreases. So, the value, the preference, or the observed value of something
[10:18.770 --> 10:25.890]  that happens further down the line, which means with greater t, is going to decrease as t increases.
[10:26.650 --> 10:32.670]  So, again, the objective is to find the optimal policy pi star that maximizes this sum here.
[10:32.670 --> 10:39.510]  All right, we are going to quickly define two other definitions here. If we want to evaluate
[10:39.670 --> 10:47.010]  a particular state in these settings, one approach is to measure the value of that state using the
[10:47.010 --> 10:54.170]  value function. The value function of state s is the expected cumulative reward from following
[10:54.170 --> 11:01.430]  the policy pi from this state. So, let's assume we already have a policy. What will be the expected
[11:01.430 --> 11:09.750]  cumulative reward if you start in state s and keep following policy pi? Now, we also define
[11:10.310 --> 11:19.210]  a Q value, or a Q function, which tells us how good is a particular action A if it's performed
[11:19.210 --> 11:26.650]  in state s. In other words, if we are in state s, if the agent is in state s and performs action A,
[11:26.650 --> 11:34.990]  the Q value of this setting is the expected cumulative reward from taking action A in state
[11:34.990 --> 11:40.530]  s and then following the policy. Remember, the policy is a function that tells the agent what
[11:40.530 --> 11:45.910]  action to perform given any state. When the agent performs action A in state s, it goes into a state
[11:45.910 --> 11:53.250]  s prime, or the next state, and then we can use this policy function here to see what the action
[11:53.250 --> 11:58.090]  should be, what action should the agent take at that state, which takes it to another state,
[11:58.090 --> 12:02.070]  and then the policy is used to figure out the action to take in that state, and this goes on
[12:02.070 --> 12:07.650]  until the termination of the episode or the horizon. Now that we are familiar with the value
[12:07.650 --> 12:23.230]  function and the Q function, let's go a bit deeper. The policy pi, the value function,
[12:23.230 --> 12:29.070]  or the reward model are all functions. We want to learn at least one of these from experience.
[12:29.070 --> 12:35.510]  This is the essence of RL. If there are too many states, however, we cannot just
[12:36.670 --> 12:42.830]  tabulate everything and try to experience every state and the corresponding result from that state
[12:42.830 --> 12:47.630]  or that state and any action performed, because as the size of the problem, as the state space
[12:47.630 --> 12:54.070]  and the action space increase in settings like, let's say, playing GTA 5 or a self-driving
[12:54.830 --> 13:01.090]  navigation policy in the real world, these state spaces just explode. Dimensionality
[13:01.630 --> 13:06.710]  is too high and it's just not feasible to store every possible state. In those cases,
[13:06.710 --> 13:11.190]  we need to approximate. In general and traditionally, this is called RL with
[13:11.190 --> 13:17.610]  function approximation. If function approximation is done using deep neural networks,
[13:17.610 --> 13:24.450]  then we call the training setting or the RL setting deep reinforcement learning. The term
[13:24.450 --> 13:31.450]  is relatively new. It was, I believe, introduced in late 2013, early 2014 by me and David Silver
[13:31.450 --> 13:37.190]  and others at DeepMind in their DeepQ learning paper. However, the concept is not very new.
[13:37.190 --> 13:43.630]  That said, there have been fantastic and even mind-blowing advances in this area in the past
[13:43.630 --> 13:48.690]  three or four years. We'll talk about some of those as we move forward. A quick overview of
[13:48.690 --> 13:54.250]  the taxonomy of different RL approaches and RL agents. Remember, we have value, function, policy,
[13:54.250 --> 13:59.690]  and model. There are different approaches to solving the RL problem. Sometimes we just want to
[14:00.430 --> 14:10.990]  find a policy directly. The approaches that respond or satisfy that need are called policy-based RL.
[14:10.990 --> 14:16.390]  Sometimes we want to find a Q function or a V function first and then derive the policy
[14:16.390 --> 14:22.430]  from those functions. These approaches are called value-based. Sometimes we want to first learn the
[14:22.430 --> 14:27.230]  model of the environment and then solve the problem. These are called model-based approaches.
[14:27.230 --> 14:30.620]  Sometimes we don't have the model and we don't want to learn the model explicitly,
[14:31.250 --> 14:37.150]  and those are called model-free. When we're dealing with both value, function, and policy at the same
[14:37.150 --> 14:41.730]  time, in other words, we have one agent learning the value function, another learning the policy,
[14:41.730 --> 14:50.910]  and then contrasting those with each other in a kind of zero-sum setting. We call those actor-critic
[14:51.970 --> 14:56.430]  RL approaches. As you can see, there are different approaches to different settings and different
[14:56.430 --> 15:01.910]  problems. However, all of those can still be grounded on top of the Markov decision process
[15:01.910 --> 15:07.830]  framework and the general solution approach we looked at before. One of the better-known
[15:07.830 --> 15:15.090]  approaches to RL, which falls under the value-based approaches, is Q-learning. The objective in Q-
[15:15.090 --> 15:22.170]  learning is to derive the optimal policy pi star based on optimal Q. So the optimal Q is one that
[15:22.170 --> 15:35.170]  maximizes the value of each state and action. And from that point onward, with an iterative
[15:35.170 --> 15:41.890]  formulation based on Bellman equations or dynamic programming, it becomes possible to find this
[15:41.890 --> 15:50.790]  through bootstrapping and iterative re-estimation of the Q value. There are different approaches to
[15:50.790 --> 15:58.180]  Q-learning in large state spaces or large action spaces where function approximation is required.
[15:58.690 --> 16:05.780]  One such approach is to parametrize Q as s, a, and an estimation parameter theta.
[16:07.070 --> 16:15.170]  If we solve this parametrized Q function by neural networks, where theta corresponds to the
[16:15.170 --> 16:23.450]  neural network, the approach is called Q-networks. This solution approach was proposed
[16:24.950 --> 16:32.770]  very early in 2000s, even earlier, but was perfected in some sense in 2013-2014 by
[16:32.770 --> 16:40.250]  DeepMind, by David Silver and his team, which resulted in the proposal of deep Q-networks or
[16:40.250 --> 16:47.270]  QNs. So we're talking about deep networks. What does that mean? It means that these networks use
[16:47.270 --> 16:53.010]  deep neural networks, like CNNs, which help with both function approximation
[16:53.690 --> 17:01.810]  and also end-to-end feature learning. One of the advantages of deep learning is its
[17:01.810 --> 17:07.730]  superior performance in learning feature representations, especially from images.
[17:08.630 --> 17:15.430]  Now, for those of you who are more statistically oriented, you may have seen the term IID.
[17:15.730 --> 17:22.070]  It means that the data are independent and identically
[17:22.070 --> 17:30.770]  distributed. It means that one data point does not depend on another data point, and also
[17:30.770 --> 17:37.630]  each data point is equally likely to occur. These do not hold for the RL settings.
[17:37.630 --> 17:44.170]  A particular state, for example, is highly correlated with its previous state and action.
[17:44.170 --> 17:49.730]  Why does this matter? Well, a lot of our supervised and deep learning approaches
[17:50.570 --> 17:58.550]  are based on the assumption that the training data is IID. When this is not possible,
[17:58.550 --> 18:06.850]  what happens is, in response to this problem, what happened was the DQN approach introduced
[18:06.850 --> 18:12.810]  experience replay. It's like a bag of all data which is randomly sampled in each training
[18:12.810 --> 18:18.530]  iteration to reduce the correlation, the sequential correlation, the temporal correlation
[18:18.530 --> 18:26.830]  of data points, and also make it more likely for data to be evenly distributed. Also, to
[18:27.490 --> 18:35.450]  reduce the effect of oscillation during training, DQN uses fixed parameters for a target network.
[18:35.450 --> 18:42.790]  The target of optimization is fixed and is updated every few thousand iterations,
[18:42.790 --> 18:46.270]  so that reduces the oscillation problem. And, of course, the rewards are normalized to
[18:46.270 --> 18:53.770]  minus one to one to make sure the performance, the reward signals are bounded.
[18:54.930 --> 19:00.630]  Now, now that we have a preliminary understanding of deep reinforcement learning and one of its
[19:00.630 --> 19:06.130]  implementations, one of the parallel approaches called DQN, let's take a quick look at adversarial
[19:06.130 --> 19:10.090]  machine learning. And I assume that by now, those of you who are not familiar with adversarial
[19:10.090 --> 19:15.810]  examples have become, have been introduced to this concept. The idea here in adversarial examples,
[19:15.810 --> 19:20.590]  we are now talking in the realm of, we are speaking in the context of image classifier
[19:20.590 --> 19:27.050]  supervised learning. Let's say we have an image classifier trained on a set of images of pandas,
[19:27.050 --> 19:33.290]  cats, and others to identify the object in those images. In the beginning, we pass the image of a
[19:33.290 --> 19:39.590]  panda to the classifier, and it correctly classifies it as panda. Now, it's been demonstrated,
[19:39.590 --> 19:47.490]  it's been actually established by now, that it is possible to make the classifier,
[19:47.490 --> 19:51.030]  to induce incorrect classifications in deep learning models or machine learning models in
[19:51.030 --> 19:58.590]  general by adding minute, minimal perturbations to the original image. As you can see,
[19:58.590 --> 20:06.710]  it's almost impossible to detect or see any changes in this final image. These perturbations,
[20:06.710 --> 20:15.890]  these pixel perturbations are very small. Now, this is one example of different attacks or
[20:15.890 --> 20:20.170]  different vulnerabilities in the classical realm of adversarial machine learning, which is mostly
[20:20.170 --> 20:25.870]  concerned with supervised learning and sometimes unsupervised learning. In general, the adversarial
[20:25.870 --> 20:34.350]  objectives in AML or adversarial machine learning can be classified under the traditional CIA
[20:34.970 --> 20:39.010]  triad, confidentiality, integrity, and availability. So, with respect to confidentiality
[20:39.010 --> 20:45.670]  and privacy, an adversary may wish to target the confidentiality of the model parameters,
[20:45.670 --> 20:53.650]  architecture, no intellectual property theft is an issue in larger models, or it may target the
[20:53.650 --> 20:59.310]  privacy, the adversary may target the privacy of the training and test data. If medical, for example,
[20:59.310 --> 21:04.450]  if medical records were used in the training data, there have been proof of concept attacks showing
[21:04.450 --> 21:10.990]  that it is possible to infer whether a particular patient or the records pertaining to a particular
[21:10.990 --> 21:16.290]  patient were used in the training data or not, or sometimes it's possible to reconstruct the
[21:16.290 --> 21:21.590]  training data set by just having access to the model itself. And that is a major HIPAA violation.
[21:22.070 --> 21:28.750]  In essence, it's a privacy violation. Also, with regards to integrity and availability,
[21:28.750 --> 21:33.490]  the attacker may target the integrity of predictions or the outcome of the model,
[21:33.490 --> 21:38.810]  the performance of the model. For example, adversarial examples are an attack on the
[21:38.810 --> 21:44.090]  integrity of image classifiers or supervised machine learning models. And also an adversary
[21:44.090 --> 21:48.790]  may target the availability of the system that is deploying machine learning, for example,
[21:48.870 --> 21:55.170]  a facial recognition system or an autonomous navigation system in a driverless car. Now,
[21:56.130 --> 22:03.230]  now that we know so much about adversarial machine learning and security vulnerabilities
[22:03.230 --> 22:07.250]  of classical machine learning, supervised learning and unsupervised learning,
[22:08.050 --> 22:14.910]  there is a major question. And it's whether deep RL is immune to those attacks.
[22:14.910 --> 22:20.590]  Back in 2016, late 2016, when the research community was just beginning to start,
[22:20.590 --> 22:25.790]  was just beginning to pay attention to both deep reinforcement learning and the issue of
[22:25.790 --> 22:35.450]  adversarial examples. I came up with this question and decided to experiment with it a little to
[22:35.450 --> 22:42.210]  find out whether deep RL can also be vulnerable to such attacks. I started from a simple observation.
[22:42.210 --> 22:46.790]  The deep neural networks in DQN models and classifiers are both function approximators,
[22:46.790 --> 22:51.010]  are both at training time, there are function approximators at test times or just functions.
[22:51.810 --> 22:58.230]  And I came up with this hypothesis. If classifiers are vulnerable to adversarial examples,
[22:58.230 --> 23:05.250]  then action value approximators of DQNs may also be vulnerable. All right. So I started to
[23:05.450 --> 23:15.230]  set up an experiment where the aim was to perform adversarial attacks. The adversary's goal
[23:15.850 --> 23:21.110]  was twofold. Test time attack to perturb the performance of target's learned policy. So
[23:21.110 --> 23:25.670]  the target at this point is fully trained and is deployed in the environment and the adversary
[23:25.670 --> 23:33.810]  wants to somehow cause the agent to perform incorrectly, to manipulate its policy.
[23:34.890 --> 23:39.450]  What does the adversary know about the target? It knows the type of input to the target. For
[23:39.450 --> 23:46.930]  example, it knows whether the target policy is looking at image data, text, audio, and such.
[23:46.930 --> 23:51.050]  Why? Because it helps with estimating the architecture. For example, if it's images,
[23:51.050 --> 23:58.690]  the adversary can come up with a good guess that the target architecture includes CNNs
[23:58.690 --> 24:02.890]  or convolutional neural networks. I also assume that the adversary knows the reward function.
[24:02.890 --> 24:07.210]  So the adversary may have access to the environment and if it's, for example, a game
[24:07.210 --> 24:14.130]  environment, it knows what the scores are, how the scores are generated. What is not known? The
[24:14.130 --> 24:21.710]  knowledge of target's neural network architecture is not known. So it's a black box attack. And also
[24:21.710 --> 24:28.150]  the initial parameters, the initialization of the target neural network is also not known.
[24:28.150 --> 24:31.910]  What is available to adversary in terms of actions? The adversary may perturb
[24:32.550 --> 24:39.610]  the environment where the target performs. For example, it can change pixel values in a game
[24:39.610 --> 24:48.570]  environment through a man-in-the-middle attack. I consider in this work two techniques for
[24:48.570 --> 24:53.370]  perturbing that environment. One is the classical fast gradient sign method for generating adversarial
[24:53.370 --> 24:57.830]  examples and the other is the Jacobian-based saliency map attack or the JSMA approach
[24:57.830 --> 25:09.490]  introduced by Paperno in 2015, I believe. Now, in that experiment, I used the classical DQN
[25:09.490 --> 25:18.250]  approach, the classical DQN architecture introduced by Nie in an Atari game, the game of Pong.
[25:18.450 --> 25:23.350]  And this was all implemented in OpenAI Gym with TensorFlow. Back then, PyTorch wasn't really
[25:23.350 --> 25:30.730]  and trained the agent against heuristic AI. And here's the initial proof of concept result.
[25:30.730 --> 25:33.930]  This is for the white box attack. Of course, later on, I will show you the results for the
[25:33.930 --> 25:44.870]  black box attack. And you can see that for FGSM and JSMA, regardless of how far along the training
[25:44.870 --> 25:54.230]  the agent is, both policies, the policy for the Pong agent is highly vulnerable
[25:54.230 --> 25:59.630]  to adversarial perturbations through simple techniques like FGSM and JSMA. You can see
[25:59.630 --> 26:05.870]  that for JSMA, the success rate was 100% for all of the cases. For FGSM, it was slightly lower
[26:06.540 --> 26:12.150]  and it was mostly because of the termination criteria and the perturbation threshold that
[26:12.150 --> 26:18.750]  I had defined. But you can see, it can be observed that the policy can be very easily
[26:18.750 --> 26:23.690]  manipulated through adversarial example attacks. So, at 100 random observations,
[26:23.690 --> 26:28.810]  perturbed with FGSM and JSMA, the results were fed, the perturbed images were fed to
[26:28.810 --> 26:35.790]  the trained neural network representing the agent's policy as test input, and then the success
[26:35.790 --> 26:40.930]  rate was measured. Now, is this type of attack practical? Is this really something that we should
[26:40.930 --> 26:51.490]  be worried about? So, a few years later, in 2018, Clark and his co-authors published the report of
[26:51.990 --> 26:59.770]  a similar attack on an autonomous robot based on DQN policy using ultrasonic collision,
[26:59.770 --> 27:06.150]  ultrasonic sensory input for collision avoidance. And they had shown that they can use adversarial
[27:06.150 --> 27:16.670]  perturbations to manipulate the trajectory of the robot and make it follow a path defined or desired
[27:16.670 --> 27:23.030]  by the adversary, not one that the robot itself wants to follow. There are more recent examples
[27:23.030 --> 27:29.050]  of how this sort of attack on deep RL, or in general, attacks on deep RL, can be of concern.
[27:29.530 --> 27:38.130]  One of the recent works by my graduate students is on attacks on automated trading algorithms
[27:38.130 --> 27:42.750]  based on reinforcement learning. The paper is coming out in a couple months, so I can't go
[27:42.750 --> 27:48.630]  into more details, but this is one of the more severe and urgent cases for security researchers
[27:48.630 --> 27:54.910]  to consider. Deep RL is already being used by many major financial players and stock traders,
[27:54.910 --> 27:58.950]  and it can be easily manipulated in the real world. There are other cases, of course,
[27:58.950 --> 28:05.750]  but this is one of the examples that demonstrates the practicality and the applicability of this
[28:05.750 --> 28:10.370]  attack to real world scenarios. Now, before we go further into different types of attacks,
[28:10.370 --> 28:15.490]  let's develop a threat model for deep reinforcement learning. Again, the adversary's objectives can
[28:15.490 --> 28:20.890]  follow those of the CIA triad. The adversary may wish to access the internal configurations,
[28:20.890 --> 28:26.510]  like model parameter, reward function, policy, and such, to still the model for intellectual
[28:26.510 --> 28:32.270]  property theft and such. There can be attack on integrity, which means compromising the desired
[28:32.270 --> 28:38.510]  learning or enactment of the policy. There can be attacks on availability, which are essentially
[28:38.510 --> 28:45.010]  compromises of the ability of the agent to perform training or actions when needed. Now,
[28:45.010 --> 28:50.930]  let's look at the attack surface of the RL. This is the general diagram. This is the general block
[28:50.930 --> 28:58.790]  diagram of a DRL agent or DRL system. Now, we have the agent. The agent typically has some memory
[28:58.790 --> 29:03.450]  where it stores its experiences during training, and then those experiences help with function
[29:03.450 --> 29:10.770]  approximation, data-driven policy learning, and such. And then there is an exploration controller,
[29:10.770 --> 29:15.210]  which controls how the agent explores the environment during training. There's an
[29:15.210 --> 29:20.650]  experience selector, how to select experiences from the bank of data or observations stored.
[29:20.930 --> 29:27.990]  In this data set, there is an actuator, which then enacts the actions of the agent inside an
[29:27.990 --> 29:33.970]  environment. The environment is connected back to this agent block through an observation channel,
[29:33.970 --> 29:39.390]  observation of the state, and reward channel. And it's of no surprise to the more seasoned
[29:39.390 --> 29:45.910]  security researcher and professional that all of these components can be a subject or target of
[29:45.910 --> 29:54.590]  attack. As we go forward, we'll cover some examples of attacks that can occur on each of
[29:54.590 --> 30:00.190]  these components. We've already seen an attack on the observation channel. There are attacks on the
[30:00.190 --> 30:07.930]  reward channel, which we'll hopefully touch on. And there are attacks on the agents during training
[30:07.930 --> 30:12.290]  and the actuator. Let me give you a quick example of the actuator attack. Let's assume we have a
[30:12.290 --> 30:17.970]  robot, an actual robot, learning to navigate in an environment while avoiding obstacles.
[30:18.130 --> 30:28.590]  If the robot commands or decides to, let's say, move the left wheel forward,
[30:29.250 --> 30:33.250]  but there is some sort of obstacle in front of the left wheel, and the left wheel doesn't actually
[30:33.250 --> 30:40.390]  move, then the actuator, the resulting data is going to be, the resulting observation is going
[30:40.390 --> 30:45.910]  to be skewed, because the agent is going to assume that the actuation has happened and then
[30:45.910 --> 30:52.870]  look at the changes in the observation and use that to retrain its policy, to optimize its policy
[30:52.870 --> 30:57.350]  based on faulty data. What are the adversarial capabilities? Well, we first look at different
[30:57.350 --> 31:04.530]  attack modes. The attacker can perform a passive attack where it's only observing the target,
[31:04.530 --> 31:09.830]  it's not changing anything, or it can perform an active manipulation. In passive attacks,
[31:09.830 --> 31:16.770]  the attacker can perform inverse reinforcement learning to learn about the reward function
[31:17.430 --> 31:23.250]  of the agent, or later on we'll see it can perform imitation learning to steal the policy.
[31:23.830 --> 31:30.730]  Active measures include attacks on the actuation, observation, or the reward channel. And attacks on
[31:30.730 --> 31:36.330]  observation can be targeting their representation model, how the agent sees the environment,
[31:36.330 --> 31:41.150]  or perturbing the transition dynamics, how the agent sees the changes in the environment.
[31:42.050 --> 31:46.290]  Okay, going back to our initial proof of concept, remember our original goal was
[31:46.290 --> 31:59.590]  to perform a black box attack. So, to achieve this objective, we introduce an approach based
[31:59.590 --> 32:05.730]  on the transferability of adversarial examples. So, we create a second DQN, the adversary creates
[32:05.830 --> 32:09.630]  a second DQN with similar architecture, but different initial parameters. When I say similar,
[32:09.630 --> 32:12.810]  it doesn't necessarily have to be a match, it just needs to be a convolutional neural network
[32:12.810 --> 32:18.150]  with some functional approximation techniques, but it doesn't need to have the same parameters
[32:18.150 --> 32:27.470]  or the exact same architecture. And it trains that agent, that model, on the same environment.
[32:27.470 --> 32:34.110]  The assumption is the adversary has access to that environment, and then uses the knowledge of that
[32:35.750 --> 32:41.790]  architecture and the trained replica policy. It crafts adversarial examples the same way it did
[32:41.790 --> 32:46.250]  for the white box case, and we know that many of those adversarial examples can transfer
[32:46.250 --> 32:55.930]  to similar models, train on similar data, which also you can see applies to DQN policies as well.
[32:55.930 --> 33:02.370]  And this is how we implemented a black box attack against deep RL policies. Now, what about training
[33:02.370 --> 33:10.690]  time attacks? In the same paper we introduced the policy induction attack, where the attack
[33:10.690 --> 33:20.230]  is of the adversarial example type against a DQN agent during training. Now, what are
[33:20.230 --> 33:27.550]  the different steps in this attack? First, the adversary derives an adversarial policy
[33:27.550 --> 33:34.390]  from the adversarial goal by training on the same environment where the target is going to
[33:34.390 --> 33:42.290]  perform or be trained in. So, if the adversary wants to minimize the reward gained in the game
[33:42.290 --> 33:50.250]  by the agent, then the goal, the optimization goal, is going to be the exact opposite of that of the
[33:50.250 --> 33:55.630]  target policy. The target wants to maximize the reward and the adversary wants to minimize it.
[33:55.630 --> 33:58.730]  Of course, there are different ways of formulating this adversarial goal.
[33:59.790 --> 34:08.090]  Then, the adversary creates a replica of target's DQN and initializes it randomly. And then
[34:10.810 --> 34:17.510]  comes the exploitation phase, where the attacker observes the current state and transitions in the
[34:17.510 --> 34:24.590]  environment, then estimates best action according to the adversarial policy derived in step one.
[34:24.590 --> 34:32.210]  Then, the attacker crafts perturbations to induce adversarial action based on the replica of
[34:32.210 --> 34:37.950]  target's DQN. This is exactly the same as our black box test time attack. The attacker applies
[34:37.950 --> 34:42.690]  the perturbation as a man in the middle in the observation channel. The perturbed input is
[34:42.690 --> 34:49.410]  revealed to the target and the attacker waits for target's actions. And this is a loop. This loop can
[34:49.410 --> 34:56.870]  go on until the training process of the target either converges to a suboptimal policy or up to
[34:57.310 --> 35:04.090]  a certain number of iterations. This is a very rough plot. This is not smooth yet, but you can
[35:04.090 --> 35:11.670]  see that the unperturbed agent moves towards... this is in the game of Pong still... moves towards
[35:11.670 --> 35:21.730]  convergence to an optimal total sum of rewards, while the attack agent moves towards a convergence
[35:21.730 --> 35:29.450]  to the minimum possible return of zero. It's getting closer and closer to zero, which indicates that
[35:29.450 --> 35:35.330]  the training process of DQN and DeepRL in general can also be targeted through adversarial attacks.
[35:35.330 --> 35:41.450]  Now, I'm going to introduce another type of training time attack. Again, this aims to
[35:41.450 --> 35:48.350]  induce some form of misbehavior. We call this misbehavior addiction, and this is a follow-up to
[35:48.450 --> 35:54.070]  a work I did with my colleague Roman Niempolski on psychopathological modeling of AI safety problems.
[35:54.270 --> 36:01.510]  This is a proof of concept. We consider the game of Snake. Many of you probably remember Snake from
[36:02.330 --> 36:10.570]  older Nokia phones. The DQN agent is a snake and is learning to play in this environment.
[36:10.570 --> 36:16.450]  The attacker adds a drug seed with more instantaneous reward than the typical seed,
[36:16.450 --> 36:22.650]  but it also results in more increase in the length of the tail. You can see that this can end up...
[36:24.730 --> 36:36.570]  well, if the increase in the tail length is more than a certain amount, then a longer-tailed
[36:36.570 --> 36:45.170]  snake is bound to eat its own tail sooner rather than later. So, we show that we actually derive
[36:45.170 --> 36:53.090]  some theoretical closed-form solutions for what the additional reward and increase in the tail
[36:53.090 --> 36:58.610]  length should be for addiction to emerge, meaning that the agent learns a more myopic policy instead
[36:58.610 --> 37:04.050]  of the optimal policy, and you can see that it's actually possible to make the agent addicted
[37:05.430 --> 37:11.550]  to the drug seed, as we call it, and this results in learning a sub-optimal policy.
[37:12.390 --> 37:19.510]  And, of course, due to time limitations, I'm going to introduce only one more type of attack,
[37:19.510 --> 37:26.330]  and it's that of targeting the confidentiality of a deep RL policy.
[37:27.230 --> 37:31.810]  The problem here is, or the question is, is it possible to extract a deep RL policy from
[37:31.810 --> 37:38.830]  observations of its actions? Why does this matter? Well, the security challenge posed by
[37:39.570 --> 37:45.510]  this sort of action is, of course, model theft. A company like Google or, let's say, Uber or Waymo
[37:45.510 --> 37:51.870]  may have spent billions or millions of dollars on coming up with a very accurate deep RL policy for
[37:51.870 --> 37:57.530]  autonomous navigation, and if it can be stolen by an adversary, then the intellectual property
[37:57.530 --> 38:04.330]  becomes worthless. And also, a stolen policy, an extracted policy, can be leveraged in integrity
[38:04.330 --> 38:14.230]  attacks in the same way that we mounted black box attacks on DQN policies. So, let's see.
[38:14.230 --> 38:19.610]  As it happens, a branch of reinforcement learning, or in general, one solution to the sequential
[38:19.610 --> 38:24.930]  decision-making problem, is not in the RL domain, but in the supervised learning domain, and it's
[38:24.930 --> 38:29.670]  called imitation learning. Imitation learning is the supervised learning of policies from observed
[38:29.670 --> 38:36.050]  behavior of an expert, and by behavior, I mean state action behavior. What is the policy of an expert?
[38:37.490 --> 38:47.390]  Based on this concept, in 2018, Hester et al. proposed DQFD, or DQ Learning from Demonstrations,
[38:47.390 --> 38:58.490]  which is DQN, where the initial training is done based on observed data using deep learning. So,
[38:58.490 --> 39:06.150]  they have data from human players playing a certain game, or human performance doing a
[39:06.150 --> 39:11.770]  certain task that they want the agent to learn. The initial step of training for a DQFD agent is
[39:11.770 --> 39:19.810]  supervised learning on observed data, and then it starts building on top of it through reinforcement
[39:19.810 --> 39:24.730]  learning approaches, and it was shown that it can result in faster convergence, better sample
[39:24.730 --> 39:32.330]  complexity, and sometimes more interesting and robust policies. As security researchers,
[39:32.330 --> 39:38.290]  you can probably see where this is going. This wonderful algorithm, DQFD, can also be used to
[39:38.290 --> 39:45.390]  replicate policies. Instead of applying it on observed data collected from human performance,
[39:45.390 --> 39:51.810]  it can be applied on observed data from a target policy. So, here's a proof of concept attack the
[39:51.810 --> 39:59.430]  procedure. The attacker observes and records n interactions of the SARSA type, state action,
[39:59.430 --> 40:07.010]  the next state, and the reward based on this transition of the target agent in a particular
[40:07.010 --> 40:13.170]  environment, and then the attacker applies DQFD to learn an imitation of the target policy and
[40:13.170 --> 40:22.650]  Q function. Now, at this point, the attacker may either just go away and sell the extracted policy,
[40:22.650 --> 40:28.450]  or it may decide to target it using different adversarial perturbation attacks, some of which
[40:28.450 --> 40:35.370]  we've covered so far in this talk. So, as a proof of concept, we consider a slightly less complex
[40:35.370 --> 40:41.730]  environment data cart poll, where the objective is to stabilize this poll on this cart by moving
[40:41.730 --> 40:48.970]  the cart to right and left. The reason for choosing a simple environment is merely economical,
[40:49.990 --> 40:57.310]  because we didn't want the experiments to take days or weeks. We start with a simple case,
[40:57.310 --> 41:03.990]  and we consider different types of policies. DQN with prioritized replay, an enhanced version of
[41:03.990 --> 41:10.910]  the classical DQN proximal, policy optimization, and asynchronous actor-critic. And, of course,
[41:10.910 --> 41:18.730]  we also train an adversarial RL agent, a DQN agent, whose objective is to incur maximum loss
[41:18.730 --> 41:24.330]  of reward, or in other terms, in more technical terms, maximize the regret of its target.
[41:25.390 --> 41:29.590]  And here are the results. First, with regards to replication progress, based on only 5,000
[41:29.590 --> 41:36.650]  demonstrations, 5,000 state action, next state reward observations, we see that all three
[41:37.450 --> 41:46.670]  policies are almost exactly replicated. We can see convergence to the optimal performance
[41:47.350 --> 41:54.430]  of those policies in the environment. And then we perform adversarial training. We
[41:54.430 --> 42:01.610]  train an adversarial RL agent to attack and maximize the regret of those policies,
[42:01.610 --> 42:09.330]  and you can see that this can also be easily achieved within very few iterations of training
[42:09.330 --> 42:18.550]  in CardPol. I believe we can see that for PPO2, which is a somewhat robust deep RL
[42:19.510 --> 42:25.150]  algorithm or approach, it's possible to incur maximum damage or find a policy that incurs
[42:25.150 --> 42:33.310]  maximum regret on the target within 60,000 iterations, which is a relatively low amount.
[42:33.310 --> 42:44.210]  Now, what about defenses? So, of course, similar to adversarial examples, one approach or one
[42:44.210 --> 42:51.550]  technique for reducing the impact of adversarial example attacks is through regularization. And
[42:51.550 --> 42:57.130]  one common type of regularization in supervised learning for mitigating adversarial example
[42:57.130 --> 43:03.410]  attacks is adversarial training, training the model on adversarially perturbed samples to
[43:03.410 --> 43:09.870]  make sure that it sees different perturbations of the same image and knows that all of those
[43:09.870 --> 43:16.110]  results in the same label, in the correct label. This is called essentially data augmentation in
[43:16.110 --> 43:22.990]  general as a regularization technique. So in 2017, when I was just starting to look into this
[43:23.790 --> 43:30.450]  problem or this domain, I had gone through the adversarial training literature and thought that
[43:30.450 --> 43:36.490]  the same may also hold true for deep RL. I made a hypothesis, actually two. One is with regards
[43:36.490 --> 43:42.190]  to recovery. If training time attacks are not contiguous, if not all of the observations are
[43:42.190 --> 43:49.010]  perturbed, then DRL adapts to the environment and adjusts the policy to overcome the attacks.
[43:49.010 --> 43:55.750]  This is training time. And with regards to robustness, I made another hypothesis. Such policies,
[43:55.750 --> 44:04.650]  policies trained under attack are more robust to test time attacks. And this particular
[44:04.650 --> 44:08.910]  investigation is published in a paper titled whatever does not kill deep reinforcement
[44:08.910 --> 44:16.370]  learning makes it stronger. So similar as before, we are looking at DQN in Atari games, Breakout,
[44:16.370 --> 44:26.230]  Enduro, and Pong. Now, the way I designed the experiment was based on a probability of attack.
[44:26.230 --> 44:33.090]  So as an attacker, I assign a certain probability for each state, for each observation during
[44:33.090 --> 44:39.630]  training time to be perturbed. I perform experiments, different experiments with
[44:39.630 --> 44:45.670]  different values of this P attack, 20%, 40%, 80%, and 1, which means contiguous attack.
[44:45.670 --> 44:54.420]  And it's interesting to see that for values of P less than 50%, the agent actually recovers.
[44:55.260 --> 45:03.460]  But for values greater than 50%, the training process plummets and either does not converge or
[45:03.460 --> 45:16.340]  converges to a very, very low mean return value or mean total reward value. I later on publish a
[45:16.340 --> 45:22.040]  theoretical analysis of why this happens. It's available in my PhD dissertation, which I'll
[45:22.040 --> 45:28.660]  reference in the final slide. Also, it was interesting to see that the robustness is also,
[45:28.660 --> 45:33.300]  the robustness hypothesis is also true. You can see here that after training,
[45:33.300 --> 45:40.780]  if we attack the test time policy with probability of 1, the plane or vanilla policy,
[45:41.000 --> 45:48.560]  a policy that was not trained adversarially, performs very poorly. However, policies trained
[45:48.560 --> 45:55.700]  under adversarial attacks with probability 0.2 or 20% and 0.4 perform really well. For 80%,
[45:56.320 --> 46:01.360]  it's even for policies trained at 80%, as you can see, the policy itself is already performing
[46:01.360 --> 46:08.880]  very poorly. It's surprising to see that at P equals 1, the performance gets slightly better.
[46:08.880 --> 46:14.960]  It's comparable with 40%. To this day, I'm not entirely sure why this happened. I've repeated
[46:14.960 --> 46:18.320]  the experiments a number of times and still get the same result. I still don't know why this
[46:18.320 --> 46:24.240]  happened. This is one of the interesting problems that we are looking at right now
[46:24.800 --> 46:30.160]  as a research group. Another defense that we introduce is based on parameter space noise.
[46:30.160 --> 46:36.960]  This is very much like dropout. The idea of parameter space noise was introduced in 2017,
[46:36.960 --> 46:43.080]  I believe, independently by Plamper et al and Fortunato et al. And the idea here is, again,
[46:43.080 --> 46:47.780]  similar to dropout, to introduce zero mean random noise to the learnable parameters of neural
[46:47.780 --> 46:53.120]  network in deep RL to enhance exploration and convergence in deep RL benchmarks.
[46:53.480 --> 47:02.880]  Now, in another paper in 2018, we investigate whether this approach can be used to mitigate
[47:02.880 --> 47:10.280]  the impact or severity of policy manipulation attacks on DQN, and it was shown that it
[47:10.280 --> 47:23.380]  actually performs very well compared to vanilla or classical DQN. And these are the training time
[47:23.380 --> 47:29.840]  results. You can see that if noisy net is used, if parameter space noise is used, the performance
[47:29.840 --> 47:35.320]  degradation for all environments is at a much lower slope, at a much lower rate than the
[47:35.320 --> 47:46.100]  vanilla architecture. Finally, to have proposed a solution for the policy extraction problem,
[47:46.980 --> 47:53.940]  I, along with William Hsu at K-State, came up with the idea of watermarking DRL policies.
[47:53.940 --> 47:59.860]  Watermarking has already been introduced in deep learning in general. The idea is to come up with
[47:59.960 --> 48:05.980]  a unique signature that is both difficult to remove and does not impact the performance of
[48:05.980 --> 48:13.180]  the policy itself, or the model itself, but it still provides a unique signature, proof that
[48:13.640 --> 48:22.980]  a model is the same as another model, or is a replica of the suspected model. So we introduce
[48:22.980 --> 48:27.560]  an interesting watermarking procedure. I know it's arrogant of me to call my own work interesting,
[48:27.560 --> 48:31.420]  but I still get excited when I think about the moment I came up with this idea. The idea is to
[48:31.420 --> 48:37.460]  create a second environment whose state space is this drawing from the main environment. So,
[48:37.460 --> 48:43.140]  create an environment. If you're training an agent to play a game, let's create another environment
[48:43.140 --> 48:51.500]  which has no states. None of these states are the same as the original training environment,
[48:51.500 --> 48:57.020]  or deployment environment of the agent. But the dimensionality of the states are the same. So,
[48:57.020 --> 49:02.640]  if each state in the original environment is represented by, let's say, three values,
[49:03.760 --> 49:08.740]  three features, then each state in the second environment is also represented by three
[49:08.740 --> 49:14.400]  features. It doesn't really matter what the second environment looks like. It's just some
[49:14.400 --> 49:21.380]  other environment that the agent may interact with. And then, we craft the transition dynamics
[49:21.380 --> 49:26.080]  and reward procedure for the second environment such that the optimal policy follows a looping
[49:26.080 --> 49:33.240]  trajectory. So, an optimal policy for an agent trained in the second environment is going to be
[49:33.240 --> 49:37.120]  one that follows a loop, goes through, let's say, state one, then state two, state three,
[49:37.120 --> 49:43.160]  and then goes back to state one. During training, what happens is, we periodically alternate
[49:43.160 --> 49:50.040]  between the two environments. So, let's say, at every N iterations of the training process,
[49:50.040 --> 49:55.540]  we take our RL agent from the original environment, we drop it in the second environment,
[49:55.540 --> 50:00.220]  we train it for a few iterations, and then bring it back to the original environment.
[50:00.460 --> 50:06.920]  Now, once trained, if we want to examine the authenticity or whether a policy is copied or not,
[50:06.920 --> 50:13.260]  we apply the policy in the second environment and measure the total reward. Here's the experimental
[50:13.260 --> 50:24.340]  setup. Again, we are working with CardPool. So, the watermarking environment is defined with
[50:24.340 --> 50:32.880]  five states, states one to four, plus a terminal state, which should never be reached if the
[50:32.880 --> 50:39.980]  policy is optimal in this environment. And none of these states, as represented here,
[50:39.980 --> 50:44.640]  can be found in the original, can occur in the original CardPool environment. These are
[50:44.640 --> 50:49.780]  all highly impossible, not highly, definitely and absolutely impossible to occur in the original
[50:49.780 --> 50:58.060]  environment. As for the transition dynamics, this is how we've defined it. Let a particular state be
[50:58.060 --> 51:10.400]  A0 and A1. If we are, I'm sorry, actions. Actions A0 and A1. If the agent is in state
[51:11.360 --> 51:22.220]  I and performs action I modulo two, if I is even, this is going to be action zero. If I is odd,
[51:22.220 --> 51:30.520]  this is going to be action one. Then the next state is going to be state I star or I modulo
[51:30.520 --> 51:37.460]  four plus one. So, and it receives a reward of one. If the agent performs any other actions,
[51:37.460 --> 51:43.080]  instead of going from one to two, it performs any other action, it will immediately go to the
[51:43.080 --> 51:47.560]  terminal state and receives a reward of zero. So, the optimal trajectory is state one to two to three
[51:47.560 --> 51:53.700]  to four back to state one and so on. All right. Let's see how it works. Let's look at the test
[51:53.700 --> 51:58.600]  time performance comparison of watermarked and nominal, non-watermarked policies. The
[51:58.600 --> 52:06.620]  watermarked policy performs exactly as well as unwatermarked policies. It reaches the
[52:07.500 --> 52:17.840]  optimal or best performance of 500. And when it's applied to the watermark environment,
[52:17.840 --> 52:23.800]  the second environment, it also performs optimally. It gets the maximum reward possible.
[52:24.180 --> 52:32.620]  However, you can see that if we try to apply the non-watermarked policies to the watermark
[52:32.620 --> 52:40.560]  environment, we'll see very, very small values of score, the total reward. So, you can see that
[52:40.560 --> 52:46.520]  it's possible to determine whether a policy is authentic or not or whether a policy is an exact
[52:46.520 --> 52:55.000]  copy of another policy by just applying it to a second environment and see whether it performs
[52:55.000 --> 53:00.260]  optimally or not. Now, there's there are many other things that I really wanted to touch upon
[53:00.260 --> 53:06.360]  in this talk, but unfortunately we are a little short in time. For practitioners, it may be
[53:06.360 --> 53:11.100]  of interest to have some way of benchmarking or evaluating the resilience and robustness
[53:11.100 --> 53:16.260]  of policies and compare different policies, different approaches with regards to the
[53:16.260 --> 53:22.780]  resilience and robustness. Some of my work already introduces or proposes an RL-based
[53:22.780 --> 53:30.280]  approach to perform this evaluation and benchmarking. I've also done some work on
[53:31.180 --> 53:35.480]  investigating the impact of hyperparameter choices on resilience and robustness of
[53:35.480 --> 53:40.720]  DQNs in particular, but also other model-free and actor-critic approaches.
[53:41.920 --> 53:46.860]  This can be very helpful to those who want to engineer and design the new RL
[53:46.860 --> 53:53.680]  agents to be deployed in critical environments. Also, something that I wanted to mention, but
[53:53.680 --> 54:02.020]  unfortunately I don't have time to do so, is that adversarial training is not a silver bullet. It's
[54:02.020 --> 54:07.160]  not an answer to all of the problems in DQN agents. There are certain limitations
[54:09.200 --> 54:14.340]  for robustness and resilience obtained from adversarial training of DQN agents.
[54:14.340 --> 54:21.720]  And also adversarial training is very costly in general, especially when it comes to real-world
[54:21.720 --> 54:29.480]  scenarios, real-world environments and actions. And some of my recent work is focused on improving
[54:29.480 --> 54:35.920]  the sample efficiency and computational cost of adversarial training via a new exploration
[54:35.920 --> 54:41.240]  mechanism called adversarially-guided exploration, or AGE. All of this work can be found
[54:41.940 --> 54:48.360]  in my Ph.D. dissertation, which bears the same name as this talk, False Narcotic Star Security
[54:48.360 --> 54:53.480]  and Deep Reinforcement Learning. You can find it if you search my name on Google Scholar.
[54:53.940 --> 54:58.860]  If you're interested, of course, all of these are published in separate papers in slightly
[54:59.560 --> 55:08.680]  more details under the same titles. And finally, some of the open areas of research in this domain.
[55:09.400 --> 55:15.400]  With regards to training time resilience and robustness, not much has been done with regards
[55:15.400 --> 55:20.320]  to policy search and actor-critic methods, as well as model-based and hybrid methods. Of course,
[55:20.320 --> 55:23.880]  when we talk about model-based, there are some approaches from optimal control theory and
[55:25.620 --> 55:33.140]  approximate dynamic programming that may be applied here, but very few have looked at this
[55:33.140 --> 55:37.760]  problem from a security point of view. So if you're interested, this is one of the areas that
[55:37.760 --> 55:44.260]  is in dire need of a security-oriented investigation. As for mitigation of policy
[55:44.260 --> 55:47.500]  replication, one of the ideas that my research group is currently working on is constrained
[55:47.500 --> 55:54.280]  randomization of policy. So randomize the policy such that the replication through techniques like
[55:54.280 --> 55:58.660]  imitation learning becomes more costly, more samples, more observations will be required,
[55:58.660 --> 56:06.100]  while preserving the performance of the policy. There's almost no work done in multi-agent
[56:06.100 --> 56:10.600]  settings. Of course, adversarial reinforcement learning has been investigated in settings where
[56:10.600 --> 56:15.600]  there are zero-sum agents, but not really where there is an external adversary or adversaries
[56:15.600 --> 56:21.260]  trying to exploit the inner workings of the agents, the RL components of the agent.
[56:21.960 --> 56:27.420]  One more thing that is of note is the importance of discounting. The addiction problem that I
[56:27.420 --> 56:34.840]  demonstrated earlier in the snake agent is mostly due to the constrained discounting solution. For
[56:34.840 --> 56:37.460]  those of us who come from a reinforcement learning background, you're familiar with
[56:37.460 --> 56:43.060]  the basics of reinforcement learning. You probably know that the discount factor is typically chosen
[56:43.060 --> 56:50.040]  to be 0.99 or something in the same ballpark and is left the same. It's treated as a constant
[56:50.040 --> 56:56.460]  throughout the training process. But this is very far from how our brain works and very far from
[56:57.860 --> 57:03.060]  the optimal approach or accurate approach to discounting. Our research group has recently
[57:03.060 --> 57:08.140]  started looking into this problem and is working on developing adaptive discounting
[57:09.320 --> 57:16.940]  solutions to enhance the resilience and robustness of RL agents, particularly
[57:16.940 --> 57:21.590]  deep RL agents in complex environments for AI safety and security purposes.
[57:21.960 --> 57:26.200]  Now, of course, there are naturally inspired approaches that can be looked at. For example,
[57:26.200 --> 57:33.400]  approaches coming from, let's say, TD lambda models or dopamine models of psychopathological
[57:33.400 --> 57:38.200]  problems or neurological problems and the solutions prescribed to those, as well as
[57:38.200 --> 57:43.580]  approaches in social sciences, which may help with the security problems arising in multi-agent
[57:43.580 --> 57:51.600]  RL settings. Very well, thank you very much. And I believe at this time I should be available for
[57:51.600 --> 57:53.140]  your questions.
